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Increasing practical interest has been shown in regression problems where the errors, or dis- 
turbances, are centred in a way that reflects particular characteristics of the mechanism that 
generated the data. In economics this occurs in problems involving data on markets, productivity 
and auctions, where it can be natural to centre at an end-point of the error distribution rather 
than at the distribution's mean. Often these cases have an extreme- value character, and in that 
broader context, examples involving meteorological, record-value and production-frontier data 
have been discussed in the literature. We shall discuss nonparametric methods for estimating 
regression curves in these settings, showing that they have features that contrast so starkly with 
those in better understood problems that they lead to apparent contradictions. For example, 
merely by centring errors at their end-points rather than their means the problem can change 
from one with a familiar nonparametric character, where the optimal convergence rate is slower 
than n~^'^^, to one in the super-efflcient class, where the optimal rate is faster than n~^^^ . 
Moreover, when the errors are centred in a non-standard way there is greater intrinsic interest 
in estimating characteristics of the error distribution, as well as of the regression mean itself. 
The paper will also address this aspect of the problem. 

Keywords: bandwidth; curve estimation; extreme-value theory; jump discontinuity; kernel; 
local linear methods; local polynomial methods; nonparametric regression; smoothing; super 
efflciency 

1. Introduction 

The problem of estimating the end-point and tail shape of a distribution has a distin- 
guished history, not least because it provides important examples of non-regular be- 
haviour for various types of inference. See, for example, Barter and Moore (1965) and 
Smith (1985). The problem also has important practical motivations, arising in part from 
the prevalence of power-law distributions; see Zipf (1941, 1949). More recently, end-point 
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and tail shape problems have been studied in regression settings; for example, in econo- 
metric models for auctions. 

The importance of end-point estimation to auction models and the consequent fact that 
statistical inference in such models is non-regular were first noted by Paarsch (1992) and 
Donald and Paarsch (1993). The end-point problem arises there because the distribution 
of bid price generally depends on all the parameters of the model, for instance, on pa- 
rameters that determine the costs of bidders. For particular examples of auction models, 
sec Paarsch (1992) and Donald and Paarsch (2002). 

Similar phenomena occur in truncated- or ccnsored-regrcssion models (e.g., Breen 
(1996); Long (1997)), market-structure analysis (e.g., Robinson and Chiang (1996)) and 
inference for production frontiers in econometrics (e.g., Aigner et al. (1977); Park and 
Simar (1994); Hall and Park (2004)). There is a strong association between these fields 
and those where extreme-value methods are used; for example, the successful bid at an 
auction is the extremum of all bids. 

Although the term "regression" is commonly used in these settings, strictly speaking 
it is not correct. Since the error distribution is not centred at its expectation then the 
"regression mean" no longer admits its conventional definition as the average of the re- 
sponse variable given the value of the covariate, or explanatory, variable. This apparently 
minor distinction can have a major impact, and, for example, can lead to an intriguing 
paradox, as we shall show shortly. 

In the context of auction models, Hirano and Porter (2003), Jofrc-Bonct and Pcsendor- 
fer (2003) and Chernozhukov and Hong (2004) studied parametric approaches to infer- 
ence about distribution end-points and jump heights. Campo et al. (2002) suggested a 
semi-parametric technique. Related statistical work tends to be in the setting of paramet- 
ric regression; see, for example, Koenker et al. (1994), Smith (1994), Jureckova (2000), 
Portnoy and Jureckova (2000) and Knight (2001a). Knight (2001b) generalised several 
of the contributions of Smith (1994). In particular, in the context of curve estimation, 
he sketched the derivation of properties of estimators similar to those that we propose, 
although in cases where bias can be neglected and the shape parameter is fixed. 

However, it is feasible to take a nonparametric view of this problem, permitting a 
greater degree of flexibility and generality. For example, Korostelev and Tsybakov (1993) 
treated a variety of boundary estimation problems from a nonparametric viewpoint. 
However, their work was generally in settings where information was available on both 
sides of the boundary and where data did not become relatively sparse as the boundary 
was approached. Therefore, the convergence rates derived by Korostelev and Tsybakov 
(1993) were faster than those that we give in the present paper. Chernozhukov (1998) 
addressed properties of nonparametric estimators alternative to ours, and derived upper 
bounds to convergence rates. Chernozhukov (2005) discussed a related problem, where a 
concise linear model, rather than a relatively unspecified smooth function, supplied the 
basis for inference. Among the contributions made here, over and above Chernozhukov's 
work, we give (for the somewhat different estimators proposed here) the structure of 
limiting distributions, provide a concise first-order description of the bias- variance trade- 
off, discuss empirical choice of bandwidth, and give a detailed account of statistical issues, 
such as convergence rate paradoxes, for alternative, readily computable estimators. 



616 



P. Hall and I. Van Keilegom 



The present paper suggests nonparametric methodology, and describes its properties, 
in the context of inference about end-point and tail shape functions in nonparametric 
regression. In this case the errors, or disturbances, in the nonparametric model are centred 
at their end-points, rather than at their means. The end-points may be assumed to take 
a convenient value such as zero. Thus, the problem of estimating the nonparametric 
regression mean becomes that of adaptively estimating the centring function. 

Estimation of characteristics of the error distribution is sometimes also of practical 
interest. This problem can have several forms, depending on the extent of generality re- 
quired. For example, if the error distribution has a jump discontinuity at its end-point 
then the height of the jump can be treated nonparamctrically, or modelled paramet- 
rically, as a function of the explanatory variable. The end-point might be approached 
in a polynomial way, and then the exponent, or degree, may be one of the subjects of 
inference. This paper will address those issues, too. 

The problem of nonparametric regression with end-point-centred errors also has signif- 
icant theoretical motivation. In particular, depending on the way in which the end-point 
is approached; substantially faster convergence rates can be achieved than in conventional 
settings. For example, suppose we observe Yi = a{Xi) + Si for 1 < i < n, where the errors 
Ei are independent and identically distributed with a distribution that has a jump dis- 
continuity at one of its end-points and finite variance and a denotes a twice-differentiable 
function. The estimator of a given in this paper has root-mean-square convergence rate 
n^^/'^, which beats even the rate for a parametric setting, let alone the rate n~'^/^ 

for standard nonparametric regression with twice-diffcrcntiablc functions. We shall show 
that the rate n^'^/'^ is minimax optimal. 

However, it is well known that the rate is also minimax optimal for estimating 

the same function. How can this be? This paradox can be resolved by noting that the 
two functions being estimated are not quite identical. They differ by a constant equal 
to the difference, i5, between the mean and the end-point of the error distribution. The 
constant cannot be estimated at a faster rate than n~^^^. However, this explanation is 
not without its own element of surprise, since it might be thought that estimation of 5 
would be a semi-parametric rather than a nonparametric problem; if we could observe 
the errors directly then we could estimate their end-point at rate and their mean at 
rate n~^/^, both expressed in root-mcan-square terms. 

2. Methodology 
2.1. Model 

Assume that data {Xi , Yi ) , . . . , (X„ , y„ ) are generated by the model 

y, = a(X,) + e., (2.1) 

where a denotes a smooth function, each Xi is a p-vector and each Yi is a scalar. It is 
supposed that the distribution of the error, or disturbance, e^, conditional on Xi = x, 
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has density f{-\x) with the property that f{u\x) = for m < and 



f{u\x) = 6(a;)c(x)u"(^)^i + 0{u' 



c{x)+d-l 



) 



as w J. 0, 



(2.2) 



where < d < oo. The quantities b and c are smooth, strictly positive functions from W 
to M. We wish to estimate a, and sometimes also b and c. 

Taking the view that the locus of points {x, a{x)) represents a boundary, models similar 
to (2.1) and (2.2) have been treated before, although, with the exception of literature 
discussed in Section 1, generally only when p = 1. In the latter case, and in the context of 
statistics, contributions include those of Hardle et al. (1995), Hall et al. (1997, 1998), and 
especially Gijbels and Peng (2000). There is also a vast econometrics literature in the case 
p= 1, often including a shape constraint, such as convexity, in addition to smoothness 
of a. See, for example, Korostelev et al. (1995a, 1995b), Kneip et al. (1998) and Gijbels 
et al. (1999). 

2.2. Nonparametric estimation of a 

Let h>Q denote a bandwidth. Given x G M^, let S{x, h) be the set of pairs {a, (3), where 
a is a scalar and /? is a p- vector, such that Yi > a + (f^ (Xi — x) for all indices i with 
\\Xi — x\\ < h. Our initial estimator of a{x) is 



Should there be very few or no indices such that \\Xi — x\\ < h, locally increase h so that 
a moderate number of XiS lie in this range. 

The one-sided nature of inference in this problem raises interesting issues connected 
with existence of the estimator and edge effects. To appreciate why, consider the case 
where the points Xi, for 1 < i < n, all lie in a p-variate half-space defined by an infinite 
plane passing through x. Then there exists (3 such that 0^{Xi ~ x) <Q for 1 <i <n. 
Since the length of (3 can be chosen arbitrarily large without altering the sign property, 
a{x) as defined at (2.3) equals -|-oo. 

Let TZ denote the support of the common density, gx, of the X^'s, and write dTZ for the 
boundary of TZ. li gx is continuous and positive in TZ, and if x is distant at least sh (where 
s > 0) from dTZ, then the probability that a{x) = +oo converges to zero exponentially 
fast, as a function of n, as the latter increases. See Section 5.1. However, if x lies exactly 
on dTZ, then, depending on the shape of the boundary, the probability can equal 1, even 
for finite n. Details are given in Section 3.1. 

Arguably the simplest way of overcoming these difficulties is to set an upper bound, 
i3, on the largest value that a[x) can take and estimate a{x) by averaging a{u) over 
all values of u for which |a; — u| < hi and |a(u)| < i?, where hi is another bandwidth. 
We shall discuss this approach in the next paragraph. Another method, more difficult to 
implement, is to distort the region of radius h centred at x, within which Xi must lie in 
order for {Xi.Yi) to be used to construct a[x), so that the region includes values of Xj 



a{x) = sup{a : (a, (3) e S(x, h)}. 



(2.3) 



618 



P. Hall and I. Van Keilegom 



that are further than h from x and appropriately complement the values of Xi that are 
within h of X. 

One form that the averaging of a{u) can take is based on local linear smoothing. There 
we choose di = ai e M and /3i G to minimise 



where K is a bounded, spherically symmetric probability density supported on the p- 
variate unit sphere centred at the origin and hi is another bandwidth. Then we put 
a{x) = ai. 

Alternatively, we may define 



where TZ{x) denotes the set of points u £ R-p such that x + hiu e TZ, and B is chosen 
sufficiently large to ensure that the denominator in (2.5) is non-vanishing. Both these 
approaches also overcome problems caused by discontinuities in the function a. While 
both address the issue of boundary effects, the estimator d suffers less from boundary 
bias than d. In both d and d we may use a soft thresholding approach to inclusion of 
values of u for which |d(a;-|-M)| < i?, rather than the hard thresholding suggested by (2.4) 
and (2.5). Since d and d are based on local linear rather than local constant smoothing, 
they enjoy good performance near boundaries; our theoretical analysis in Section 3 will 
demonstrate this feature. General polynomial optimisation methods can also be employed 
to estimate a, although at the expense of greater computational labour. 

Plug-in methods can be used to choose the bandwidth, h, empirically. However, mo- 
tivation for that technique requires theory about large sample properties of a, and so 
discussion of empirical bandwidth selection is deferred to Sections 2.4 and 3.2. 

The local linear estimator introduced in the first paragraph of this section can be 
viewed as based on a local, functional version of a linear programming algorithm. See 
Smith (1994) and Portnoy and Jureckova (2000) for related methodologies. The more 
general estimator, introduced in the paragraph above, requires polynomial programming 
for implementation. 

If c{x) lies in the interval (0, 2) then the rate of convergence of the estimator at (2.3) 
cannot be improved. If c{x) > 2 then the rate of convergence can be enhanced by using 
unboundedly many order statistics, where the number employed is a second smoothing 
parameter (in addition to the bandwidth h) and its optimal choice depends on knowing, 
or estimating, the main features of the remainder term in (2.2). See, for example. Hall 
(1982) and Smith (1985). However, when c{x) > 2 the data are very sparse in the neigh- 
bourhood of the boundary, and so inference about the remainder in (2.2) is especially 
difficult. Therefore, the empirical challenges posed by this approach usually outweigh any 
performance gains that might be achieved in practice, and so we shall not pursue such 
methods. 




(2.4) 



d{x) 



ItHx) ~^ hiu)I{\a{x + hiu)\ < B}K{u) du 



(2.5) 



Itz(x) I{\a-{x + hiu)\<B}K{u)du 
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2.3. Nonparametric estimation of b and c 

In principle, completely nonparametric methods may be used to estimate the functions 
b and c, although in practice one would often take c to be a constant, rather than a 
non-degenerate function of x. 

When estimating b and c we need not use the numerical value of h employed for d. 
However, in the brief account below we shall continue to use the notation h. Define the 
residuals ei = Yi — d{Xi), and let T{x,h) denote the set of e^'s for which > and 
ll-^i — < h. Put A^i = ^T{x,h), and rank the elements of T{x,h) as < e(^i){x,h) < 
■■■<£{Ni)ix,h). Put 



where r, another smoothing parameter, denotes a threshold. Optimal choice of bandwidth 
for estimating b and c is a highly complex matter, and will not be treated here. The 
estimators b and c can be thought of as local function versions of conditional maximum 
likelihood estimators suggested by Hill (1975). 

2.4. Outline of theoretical properties 

We shall show in Section 3 that, when constructing the local linear estimator 5 and its 
smoothed versions a and a, it is generally optimal to choose h ^ const. In this 
case the estimators have root-mean-square convergence rate rj-2/(p-i-2c)^ when applied to 
cases where a has two derivatives. For very general choices of the error distribution, this 
rate is optimal when < c < 2. Even if the functions b and c £ (0, 2) take known, constant 
values, and we know the error distribution exactly (e.g., that it is gamma or WeibuU), 
the rate cannot be improved upon. 

However, when c > 2, and wc have sufficient information about the error distribution, 
the convergence rate of estimators of a can be improved by using other approaches. For 
instance, if b and c are constant, and if the error density / is known, then an estimator of 
a that is based on maximising a "local" version of log-likelihood can produce an estimator 
that converges to a at rate ri~^/'''+"'\ rather than n^^/(p+^'^\ when p> 2 and a has two 
derivatives. 

The problem is more awkward when the error distribution is not known. There, the 
convergence rate ri~^/(^'+^'^) can be close to optimal. In particular, if we know only that 
the errors have a common density /, with f{u) ~ bcu'^^^ + 0(w^+'') as u i 0, where 
6, c > are fixed constants, then the minimax optimal convergence rate of estimators of 
a is n~^/(P+^'^^ where 5{d) > converges to zero as d J, 0. 





-c{x) 
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3. Theoretical properties 

3.1. Convergence rates of estimators of a 

Assume that data {Xi,Yi) arc generated by the model at (2.1). where 

the pairs (Xi, ei), (X2, £2)7 • ■ • are independent and identically dis- 
tributed as {X, e) ; the density of X is supported in a compact re- 
gion, TZ C RP, and is continuous and non-zero there; P{e > 0) = 1; 
the distribution of e, conditional on X = x, is absolutely continu- 
ous with a density, /(•|.t), which satisfies (2.2) and, in the notation , 
there, d > is fixed, b and c are Holder-continuous functions sat- 
isfying Ci < b{x),c{x) < C2 for all x £TZ, Ci and C2 are constants 
satisfying < Ci < C2 < 00 and the remainder in (2.2) is of the order 
stated there, uniformly in x ^TZ; and sup^ £'(e^+''|X — x) < 00 for 
some r] > 0. 

Recall from Section 2 that the one-sided nature of the inference problem means that 
the estimator a will often tend not to be defined at the boundary. However, a may be 
well-defined very close to the boundary. To elucidate this behaviour we shall consider 
two types of x, described in (3.2) below. By way of notation, given a point xq in the 
boundary dTZ of TZ, let v(xq) denote the inward-pointing normal to the tangent plane at 
a:o, which is well defined if dTZ has a continuously turning tangent in a neighbourhood 
of Xq. Then we ask that: 



Either x is fixed as an interior point of 7Z, or x = x{n) is within 

order h of dTZ, in the following sense: Suppose that for some xq e dTZ, 

dTZ is of codimension 1 and has a continuously turning tangent in 

a neighbourhood of xq, and that for some sufficiently small 6 > 0, (3-2) 

xi + v{xi)t e TZ for all xi G dTZ with ||xo — xi\\ < 6 and all <t < 5. 

Define x = x{n) =2:0 + v{xo)sh, where xq G dTZ, s > and xq and s 

are held fixed. 

If x G 7?. is an interior point, or if a; = x{n) = xq + v(xo)sh where xq £ dTZ and s > 1, 
let U{x) denote the closed, p-variate sphere of unit radius centred at x. If a; = x{n) = 
Xo+v{xo)sh with xq S dTZ and < s < 1, takeU{x) to be the larger of the two parts of the 
just-mentioned sphere that are obtained by cutting it by the plane that is perpendicularly 
distant s from the origin and has its normal in the direction v{x), pointing towards the 
centre to the sphere. 

Let d and d denote the p- vector of first derivatives and pxp matrix of second derivatives 
of the function a and suppose that 

the function a has two continuous derivatives in TZ, and if a; = 2:0-1- ro o\ 

v{xo)sh then dTZ has a continuously turning tangent plane at xq. 
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Assume, too, that 

for some < 77 < l/(2p) and all sufficiently large n, ri^~^'^/p) < h< 



(3.4) 



Given x £ TZ, let Ei,E2t.. denote independent, exponentially distributed random 
variables (all with unit mean), write 7 for Eulcr's constant and define 



Zj (x) — exp 



{00 j-i 
i=j 1=1 



J > 1. (3.5) 



Given x G 72,, let Ui{x), U2{x), ... be independent and identically distributed random p- 
vectors, independent of the Zj{x)^s and uniformly distributed on lA{x). For ci,C2 > 0, 
define 

Qi{ci,C2\x)^ sup inf [ci{(fU,{x) + W,{x)^a{x)U,{x)}+C2h{x)-^/'^''^Z,{x)]. 

Note that Q(l,0|x) is a constant. In Theorem 1, below, this degenerate distribution is a 
limit in the case where a{x) — a{x) is asymptotically dominated by bias. 

In the statement of Theorem 1 we let xi denote x if a; is an interior point of TZ, and 
xi = Xf) if X = x{n) = .To + f (a;o)s/i. Let w{p) be the content of the p-variate unit sphere 
(thus, w{l) = 2, w(2) = 7t), let gx{x) represent the value of the density of the distribution 
of X at X and put Wx = w(j))gx{xi). (To simplify notation we suppress the role of xi 
here.) We use a simpler rule than that in Section 2.2 to take care of cases where a{x) 
is infinite. However, the last sentence in the theorem remains true if we define d(x) to 
equal zero whenever |a(a;)| > B, provided B > |a(a;)|. 

Theorem 1. Assume (3.1)-(3.4). (a) If {w^nhPy/^^'^^h^ p, where /9e[0,oo), then 
(M;^n/iP)i/'=(^i){a(x)-a(x)}^Qi(p,l|xi) in distribution, (b) If {nhPy^"^''^^ X h"^ ^ 00 
then h~^{d{x) ~ a{x)} — > Qi(l,0|a;i) in distribution. Furthermore, if we take the pre- 
caution of defining d{x) to equal an arbitrary but fixed constant in cases where it would 
otherwise be infinite, then second moments converge to those of the limiting distributions. 



Proofs of Theorems 1-3 will be given in Section 5. It is crucial, in condition (3.2), 
that we take s > rather than s > 0. If s = then x lies right on the boundary of TZ, 
and in such cases the theorem is false. For example, if 72. is a convex region with a 
smooth boundary, such as a sphere, then with probability 1, a{xo) — 00 for all xq G dTZ. 
However, it follows from the theorem that for points x that are arbitrarily close to dTZ, 
on the scale of the bandwidth, without being right on the boundary, the probability that 
a{x) is finite converges to 1, and in fact the estimator d(x) attains optimal convergence 
rates. The main impact of the boundary is to reduce the number of design points used 
to construct the estimator. This tends to inflate estimator variance, although only by a 
constant factor (determined by the geometry of the boundary in the vicinity of x). The 
difficulty can be alleviated by increasing the bandwidth in such places. 
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Asymptotic properties of a and a are similar, except that the Umiting distribution of a 
is more tedious to define. Therefore we shall confine ourselves to d. To further abbreviate 
our treatment we shall restrict attention to the case where 

X is an interior point of TZ, hi ~ th for a fixed constant t > 0, and „^ 
Kn/if)i/'=(^)/i2^pe [0,oo). ^ ' 

Let Zi, Z2, ... be as at (3.5); for simplicity we drop the argument x. Re-define Ui, U2, ■ ■ ■ 
to be independent of one another and of the Zj's and uniformly distributed in the 
p-variate sphere of radius t + I centred at x. Given a p-vector u with < t, let 
{Si{u),Ti{u)),{S2{u),T2{u)), . . . denote the values {U,^(^u), Zt^^u)), (C/»2(«)' ^i2(«))' • ■ • oi 
{Ui,Zi) ~ {Ui,Zi{x)) for which \\Ui — u\\ < h, ordered such that Zi^^^) < ^12(11) < 
With K =p^^ J WuW^Klu) du, denoting the Laplacian operator and p > as in (3.6), 
define 

Q2iu\x) ^ snp inf [p{(3^ Si{u) + ^Si{u)'^d{x)Si{u)} 

+{{t+irb{x)}-'^<^^T,{u)], 

Qaix) = ^pfK{V^a){x) + J Q2{u\x)K{u) du. 

Under conditions (3.1)-(3.6), and taking B > \a{x)\ in (2.5), it can be shown that with 
probability 1 — 0(n~*-^) for all C > 0, the estimator d{x) at (2.5) satisfies 

a{x) = J d{x + hiu)K{u)du. (3-7) 

Theorem 2 applies with equal validity to the estimators at (2.5) and (3.7). Nevertheless, 
the estimator on the right-hand side of (3.7) does not enjoy the boundedness property 
that partly motivated d at (2.5). 



Theorem 2. Assume (3.1)-(3.6), and that the kernel K used to define d{x) is a bounded, 
spherically symmetric probability density supported on the unit sphere centred at the ori- 
gin. Then (wj;n/i^)^/'^'^^{d(a;) — a(a;)} — > (53(2;) in distribution. Furthermore, if in the 
integrand at (3.7) we take the precaution of defining d{x + hiu) to equal an arbitrary 
but fixed constant in cases where it would otherwise be infinite, then the second moment 
converges to that of the limiting distribution. 



3.2. Choice of bandwidth 



Theorems 1 and 2 imply that, except in pathological cases where d{x) = 0, the optimal 
convergence rate of d{x) and d{x) to a{x) is achieved by choosing the bandwidth h so 
that {nh'P)~^/'^^^^ and h^ are of the same size and, in particular, hr^ const.n'^^^^^^'^'^'^'^^^ . 
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If X does not lie on the boundary of TZ, and if {wxnhPy^'^^^'>h^ ^ P G [0,oo), then the 
asymptotic mean squared error of a{x) is given by 

t(p|.t) = £;( sup inf [p{/3T^,, + i[/Ta(a;)[/,;} + 6(x)-^/^(")Z,(.T)]|^ (3.8) 

[3i£RP l<4<oo J 

where Ui,U2, ■ ■ ■ are uniformly distributed on the unit sphere centred at x, Zi{x), Z2(x'), . . . 
arc defined at (3.5) and the C/i's and Zi{x)'s are completely independent. Therefore, if 
/i = w^^/^P+^'=*'^'Vi/t2+p/c(x)}j^-i/{p+2c(:E)} ^j^gj^ p should ideally be chosen as 

Po{x) = argminT(p|a;). (3.9) 

p 

One technique for estimating d{x) is to twice numerically differentiate a heavily 
smoothed version of d. A simpler approach, if we may make the assumption (A), say, 
that, for each i, the distribution of Si does not depend on Xi, is to pass a traditional 
smoother through the data {Xi,Yi) estimates the value of ^(x) = E{Y\X = x). Under 
(A), this quantity differs from a only by a constant, and so d = /i. The latter function 
can be estimated using conventional cubic smoothing. This approach is attractive even 
if the distribution of e,j depends to some extent on Xj , since it gives a working empirical 
approximation to d. 

Methods for estimating b{x) and c(x) were discussed in Section 2.3. Substituting these 
estimators for the true values of d(a;), b{x), c{x) and p{h, x) in (3.8), we may compute an 
estimator f{p\x) of t(p\x) using a Monte Carlo simulation, which leads to an estimator 
Po{x) of po(x) at (3.9). The density of X at x, that is, gx{x), can be estimated more 
conventionally, and thus an estimator Wx of Wx = w{p)gx{x) can be constructed. An 
empirical bandwidth selector is then given by 

h{x) = {5-i/{P+2e(.)}pg(^)i/{2+P/£(^)}„-i/{p+2c(.)}^ (3^10) 

In many circumstances it is feasible to take c{x) to be a constant, not depending on 
X. Then a global approach to bandwidth choice is possible, as follows: We shall proceed 
as though the density gx is constant; if it is not, using its average value rather than 
attempting to accommodate its variation greatly simplifies matters. Thus, we take w 
to be an estimator of the average value of Wx- The mean integrated squared error of 
d{x) is asymptotic to r(p) = ^^T{p\x)dx, of which an estimator is f{p) — ^^f{p\x)Ax^ 
leading to an estimator po = argminp'r(p) of po — argminpT(p). A global bandwidth for 
constructing h is thus ^ w-i/(p+2£)pi/(2+p/e)^-i/(p+2c)^ 

3.3. Optimality 

We shall show in this section that the convergence rates implied by Theorems 1 and 2, and 
also lower bounds of the same orders, are available uniformly over classes A of functions 
a with two bounded derivatives. The possibility that either the proportionality constant. 
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6, or the exponent, c, varies with the design variable, Xi, is not relevant to discussion of 
the lower bound, and for this reason, for simplicity, and since our lower bound results are 
stronger if we narrow the class of error distributions for which worst-case performance 
is achieved, we shall take the distribution of £ = to be a single, specific one, say the 
gamma: 

/(w) = = -i-u'^-ie"", where c> is fixed. (3.11) 
i [c) 

In the lower bound calculations, c > will be assumed known. 

Likewise, we shall treat just one distribution of X = Xi and one region TZ. In particular, 
writing V(x,r) for the closed sphere centred at x and of radius r > 0, we shall assume 
that 

TZ = V(a:;o, 1) and X is uniformly distributed on TZ. (3.12) 

Given C > 0, let A = A{C) denote the class of functions a for which first and second 
derivatives exist and are bounded absolutely by C, let A denote the class of bounded 
functions a of the data {Xi,Yi), . . . ,{Xn,Yn) (the latter generated as at (2.1)) and let 
TZh be the set of all points in TZ that are distant at least h from dTZ. 

Theorem 3. Assume (3.11) and (3.12) and, when constructing a{x), let h~ const, x 
^-i/(p+2c)^ except that we take a{x) equal to an arbitrary but fixed constant in cases 
where it would otherwise be infinite. Then, 

sup supi;{a(a;)-a(x)}2 = 0(n-2/(p+2c)) (3.13) 

as n —> 00. Furthermore, if < c <2, 

liminfn2/(P+2=' inf supi;{a(a;) -a(a;)}^ > for each x eTZ\ dTZ, (3.14) 

"-»°° aeAaeA 

liminf n2/(P+2'=) inf sup / E{d{x) ~ a(x)}2 dx > 0. (3.15) 

Together, (3.13)-(3.15) imply that the estimator a achieves the minimax optimal rate, 
7j-2/(p+2)^ uniformly over all functions a & A, and that the optimality can be expressed 
in cither a local or a global sense. Similarly, it may be proved that if Aq is taken to be 
the class of functions a with q + 1 (rather than 2) bounded derivatives, then the qth 
degree local polynomial approach discussed in Section 2.2 achieves the minimax optimal 
convergence rate of n"^''/'''"''^'^''-' uniformly over functions in Aq. The upper bound (3.13) 
continues to hold if the class A is increased to include a range of distributions of e for 
which the lower tail of the distribution function decreases like u*^ as u i 0, and a range 
of distributions of design points for which the density is bounded away from zero on TZ. 
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4. Numerical properties 
4.1. Simulations 

Consider independent and identically distributed data (Xi,Yi) {l<i<n) satisfying the 
model Yi — a{Xi) + Si given in (2.1). The covariate Xi has a uniform distribution on the 
interval [0, 1]. We consider three models for a{x) {0 <x < 1): 

Model 1 : a{x) = 10(a; - ao)^ oq = 0.25, 0.5, 

Model 2:a(a:) exp(-aoa;^), ao = l,2, (4.1) 

Model 3 : a{x) = ao cos(7Ta;), ao ~ 0.25, 0.5. 

Figure 1 shows the graphs of these six frontier functions. The error Si is taken from a 
Gamma distribution: 

•^("1^) = "T^^^TTT"''"^ exp{-w/s(a;)} 
s(a;)^l (c) 

(m > 0), where c > and s{x) = 1 + 2x. Note that this density is of the general type (2.2), 
with b{x) = {cs{x)''T{c)}-\ 

We carry out two simulation studies. In the first study, we investigate the perfor- 
mance of the estimator a{x) and of its data-driven bandwidth selector described in Sec- 
tion 3.2. The simulations are executed based on 100 arbitrary samples of size n = 200 
and n = 400. For each sample we estimate a{x) at a; = 0.5. The scaling parameter c is 
chosen to be 0.5, 1 or 1.5. These three values of c are such that, as u J, 0, f{u\x) oo, 




X X 

Figure 1. Graphs of the functions a(x) given in (4.1): the left figure shows a{x) for Model 1 
(ao = 0.25 (thin curve) and ao =0.50 (thick curve)), the right figure shows a{x) for Model 2 
(ao = 1 (thin solid curve) and ao = 2 (thick solid curve)) and Model 3 (ao = 0.25 (thin dashed 
curve) and ao = 0.50 (thick dashed curve)). 
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f{u\x) — > s{x)~^ and f{u\x) — *■ 0, respectively. We use local linear smoothing to ob- 
tain both d{x) and a{x). The bandwidth h is calculated from formula (3.10) and we 
have taken hi = h. To estimate a{x) we work (as explained in Section 3.2) under the 
working model that the distribution of Si does not depend on X^, in which case a{x) 
equals the second derivative of the regression function E{Y\X = x). This second deriva- 
tive is estimated using local cubic smoothing, with bandwidth 0.25. The functions h{x) 
and c{x) = c are estimated employing the procedure explained in Section 2.3, where 
r equals the smallest integer larger than 0.907Vi and the bandwidth for estimating 

b{x) and c{x) is chosen as 0.25. The kernel used throughout is the biquadratic kernel, 
K(m) = (15/16)(1 - u2)2/(|u| < 1). 

Tables 1 and 2 show the estimated bias, variance and mean square error (MSE) of a{x) 
at X = 0.5 for each of the considered models as well as the average value of the bandwidth 
h over the 100 simulation runs obtained using a Monte Carlo simulation of formula (3.10). 
Note that the functions a{x) considered in this simulation study are neither convex 
nor concave. In fact, our method imposes neither condition in contradistinction to, for 
instance, the DEA (data envelopment analysis) estimator, which requires the function 
a{x) to be convex. 

The tables show that the MSE increases when c increases, which is to be expected since 
the higher the value of c, the smaller the density f{-\x) of the error close to the frontier. 



Table 1. Monte Carlo simulations for n = 200, with optimal bandwidth given by (3.10) 



Model ao 


c 


Mean(/i) 


10 Bias 


100 Var 


100 MSE 


1 0.25 


0.5 


0.045 


0.156 


0.073 


0.098 




1 


0.071 


1.408 


1.244 


3.226 




1.5 


0.096 


3.177 


1.905 


11.995 


0.5 


0.5 


0.067 


-0.169 


0.049 


0.078 




1 


0.094 


0.798 


0.707 


1.344 




1.5 


0.120 


2.389 


1.832 


7.540 


2 1 


0.5 


0.064 


-0.010 


0.023 


0.023 




1 


0.087 


0.899 


0.812 


1.619 




1.5 


0.119 


2.406 


1.954 


7.741 


2 


0.5 


0.051 


0.079 


0.033 


0.039 




1 


0.079 


1.072 


1.032 


2.181 




1.5 


0.111 


2.551 


1.795 


8.300 


3 0.25 


0.5 


0.062 


0.009 


0.022 


0.022 




1 


0.086 


0.933 


0.908 


1.778 




1.5 


0.113 


2.534 


1.899 


8.319 


0.5 


0.5 


0.059 


-0.041 


0.031 


0.033 




1 


0.083 


1.002 


1.093 


2.098 




1.5 


0.111 


2.542 


1.924 


8.388 
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Table 2. Monte Carlo simulations for n — 400, with optimal bandwidth given by (3.10) 



Model ao 


c 


Mean(/i) 


10 Bias 


100 Var 


100 MSE 


1 0.25 


0.5 


0.033 


0.019 


0.029 


0.029 




1 


0.057 


0.799 


0.365 


1.003 




1.5 


0.087 


2.175 


0.838 


5.566 


0.5 


0.5 


0.053 


-0.208 


0.053 


0.097 




1 


0.097 


0.233 


0.429 


0.483 




1.5 


0.111 


1.573 


0.944 


3.417 


2 1 


0.5 


0.047 


-0.039 


0.012 


0.013 




1 


0.089 


0.446 


0.373 


0.572 




1.5 


0.108 


1.657 


0.927 


3.674 


2 


0.5 


0.036 


0.018 


0.017 


0.017 




1 


0.075 


0.561 


0.366 


0.680 




1.5 


0.105 


1.723 


0.856 


3.823 


3 0.25 


0.5 


0.047 


-0.029 


0.011 


0.012 




1 


0.084 


0.485 


0.360 


0.595 




1.5 


0.107 


1.691 


0.921 


3.782 


0.50 


0.5 


0.045 


-0.087 


0.027 


0.034 




1 


0.078 


0.478 


0.384 


0.613 




1.5 


0.106 


1.674 


0.909 


3.712 



and so the harder the estimation of the frontier. These findings also agree with the 
theoretical results of Section 3. This sparsity of data close to the frontier affects especially 
the bias of the estimator, since it is clear that the estimator d{x) tends to overestimate 
a{x) whenever there are few observations near the boundary. It also affects, as would be 
expected, the performance of the empirical bandwidth selector. The higher the value of 
c, the harder it becomes to estimate in an accurate way the optimal bandwidth. Finally, 
comparing Tables 1 and 2 we see that both the bias and the variance decrease as the 
sample size increases. 

In the second simulation study, the estimator a{x) is compared with 

a*{x)^siip{Yr.\\X,~x\\<h}. 

This estimator has the advantage of being simpler than that developed in Section 2, 
although it can be shown to have the inferior convergence rate of n^^/(p+'^\ rather than 
7^-2/(p+2c)^ As for the first study, 100 arbitrary samples of size n = 200 and n = 400 are 
generated. We set x = 0.5 and c = 0.5,l or 1.5. In order to make a fair comparison of 
the two competitors, the bandwidth h of cither method is selected by minimising in a 
deterministic way (i.e., minimising over the 100 samples) the MSE of each estimator. 
When constructing a{x), the choice of hi and K and the estimation of d{x),b{x) and 
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c are done as for the first study. It is found that, in 29 out of the 36 cases treated in 
Tables 1 and 2, the mean square error of d{x) is less than that of a^(a;), and that the 
median value of the ratio of mean square errors equals 0.21. 

4.2. Data analysis 

We consider data on 123 American electric utility companies, studied by Christensen and 
Greene (1976), Greene (1990) and Hall and Simar (2002), among others. We focus here 
on the relation between Yi = — \og{Ci/ Pi) and Xi — log{Qi), where Q is the cost, Qi the 
output and Pi the price of fuel for each company. We fit the model 

Y, = a{X,)+e„ 

where it is assumed that the conditional density of the errors Ei satisfies relation (2.2). 
The scatterplot of the data, together with the estimated frontier curve d{x), is shown in 
Figure 2. We restrict the region of estimation to [4.6, 11.2], to avoid estimation in sparse 




4 5 6 7 8 9 10 11 

log(output) 



Figure 2. Scatterplot of the American Electric Utility Data. The observations are represented 
by circles, the solid curve is the estimated 'regression' (frontier) curve. 
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areas of X . Both the estimation of a{x) and a(x) is done using local linear smoothing. 
At each point of an equispaced grid of 34 values between 4.6 and 11.2 we estimate the 
bandwidth h = hi from formula (3.8), yielding values in the range from 0.77 to 1.31. The 
bandwidth for estimating d{x), b{x) and c{x) is chosen as one-fifth of the total range, 
namely 1.32, whereas to estimate the design density we use kernel estimation based on 
the normal reference rule. The kernel used throughout is again the biquadratic kernel. 

Figure 2 suggests that a linear model is appropriate for these data. However, it is 
particularly satisfying to reach that conclusion using a highly adaptive method that does 
not impose linearity, or even convexity, as a prior assumption. 

5. Technical arguments 
5.1. Proof of Theorem 1 

To simplify notation we shall assume that Wx = 1 throughout; this can always be achieved 
via a change of scale. For brevity we shall deal only with the case where x is an interior 
point of TZ. Put 7a(a;) = a{x) — a, a scalar, and Jpix) = h~^{a{x) — /?}, a p- vector. Let 
T{x,h) denote the set of indices i such that \\Xi — x\\ < h and for i G2{x,h) define 
Vi = h^^{Xi — x). In this notation, 

Y,~a- 13^ {X, ~x)= 7„(a;) + h" {-f p{x)^ V, + \V^m] + h^R^{x) + 

where the remainder, Ri{x), has the property that 

sup sup \Ri{x)\ < R{h) 
xeniexix,h) 

(5.1) 

= h ^ sup sup \a{x + hu) — a{x) — hu^a{x) ~ ^h'^u^a{x)u\, 

xeH u:\\u\\<l,x+hueTl 

and R{h) ^ as /i ^ 0. 

In particular, asking that Yi> a + 0^ {Xi — x) for all indices i G 2r(a;, h) is equivalent 
to insisting that 

7a + inf {h''{-1'lV^ + \V^■6y,) + h^R^+e^]>Q, (5.2) 

i^X{x^h) 

where we have dropped the argument from ^a{x), Jp^x), d{x) and Ri{x). Let Si{x,h) 
denote the set of pairs (70, 7^3) such that (5.2) holds, and let 71 denote the infimum of 
7q over (70,7/3) (^Si{x,h). Then, a{x) ^ a{x) -71. 
It follows from this result and (5.1) that, defining 



72 = 72(2:)= sup inf {h\^'^V^ + ^V;^iiV,) + e^} 



(5.3) 
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and noting that, for any random variable A, essup^ is the infimum of constants C for 
which P{A < C) = 1, we have: 

essup sup \a{x) — a{x) — 72 1 — > 0. (5.4) 

Defining N ~ N(x, h) = j^I{x, h), we may write 72 equivalently as 

^2 = {nh^)-^/<^^ snp inf {pi(7jF(,) + IV^^^dV^e,) + . 

where pi = {nh^y^'^^^^h^ , ^(i) < ^(2) < • • • are the ordered vahies of {nh^y^'^^^^ei for i e J 
and V(i),V(2), •■• denote the concomitant values of Vi,V2,.... The factor (n/i^)^/'^''^-' here 
reflects the fact that the expected number of values Xi that lie within /i of x is asymptotic 
to a constant multiple of nh^. The power, l/c{x), is appropriate because it is the power 
applied to sample size to describe the scale of the largest value when the sample is drawn 
from a Parcto-type distribution with exponent c{x) . 
For each r > 1, 

the limiting joint distribution of ^(i) , . . . , and V^j^) , . . . , V^^-j is the 
distribution of 6(a;)~^/'^(^'^(Zi, . . . , Zr) and Ui, . . . , Ur, where the sc- j-n 
qucncc Zi, Z2, ... is as defined at (3.5) and, independently of the Zj's, 
the C/j 's are uniformly distributed in the unit p-variatc sphere. 

(See Hall (1978).) Moreover, with probabihty 1, for any interval [a,b] where < a < 6 < 
00, the suprema over pi G [a, b] of the values of ||7|| and z > 1 at which the extremum 

sup inf [pi{j^U, + ^U:fd{x)U,} + bix)-^/'-'^^'>Z,] 

7eRp l<J<oo 

is achieved are finite and, using (5.5), the same can be proved of the extremum 
sup inf {^1(7^1^(0 + ^V^■)dV^^))+^(^)}. 

-ygMP l<i<oo 

These properties and (5.5) imply that, if pi ^ p G (0, 00) as n — > 00, 

(„/jP)i/c(^)^2 sup inf [p{/?'^C/. + W^d{x)U^} + bixy^/'^'-^^Z,] 

/3eKP l<J<oo 

in distribution. The part of Theorem 1 pertaining to pi ^ p G (0, 00) follows from this 
property and (5.4). 

If pi ^ then, since ^(1) b{x)~^^'^^'^^ Zi in distribution, we have {nhPy^'^^^'>^2 — ^ 
5(a;)^^/'^'-'^''Zi in distribution. And if pi ^ cx) then 

/i~^72 ^ sup inf {P^ Ui + ^U^ d{x)Ui} = snp ini {/S'^ u + ^u^ d{x)u} , 

/3eRP l^^<°° /36Rp II«II<1 

a constant. Parts (a) and (b) of Theorem 1 are consequences of these properties. 
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To establish convergence of second moments it suffices, in view of (5.4), to prove that 
for some 771 > 0, 

there exist random variables Ai and A2 such that Ai < 
(n/iP)i/'=(^)72 < A2 with probability 1, and E(\Aj\'^+'^^) is uniformly (5.6) 
bounded for j = 1,2. 

A proof of (5.6) is given in a longer version of this paper, obtainable from the authors. 

5.2. Proof of Theorem 2 

(Recall that we assume that Wx = We shall work with the definition (3.7) of d{x). 
Defining 72(2;) as at (5.3), and noting (5.4). we have: 

d{x)= j a{x + hiu)K{u)du + j ^2{x + hiu)K{u)du + Op[h?). (5.7) 

The first integral on the right-hand side, Ii{x), equals a{x)-\-h?g{x)-\-o{h?), where g{x) = 
^t'^K{V^a){x), whence it follows that (n/iJ')^/^'^){/i(a;) — a(x)} ^ pg{x). The stochastic 
process S{u) = {nh'PY^'^^^'^^2{x + hiu) converges weakly to Q2{u) = Q2{u\x) (see below), 
whence it follows that the second integral, l2{x) on the right-hand side of (5.7), satisfies 
(n/iP)i/=(^)/2(a;) -> / Q2{u)K{u)Au. 

To appreciate why the finite-dimensional distributions of S converge to those of Q2, 
consider the marked point process in W^, where the ith point is Vi = h~P{Xi — x) and 
the associated mark is Q = {nhP{t+ l)''}^/'^^^'ei. Only the marked points that lie in the 
disc of radius t + 1, centred at x, contribute to d{x), and so we confine attention to those. 
Define C(i) < C(2) < • • ■ to be the ordered values of Ci < C2 < • • and let V(i), V(2), ... be 
the concomitant values of Vi, V2, . . . . In this new notation. (5.5) continues to hold. From 
that result it follows, using the argument in the paragraph containing (5.5), that for 
each finite set iti, . . . , in the sphere of radius t + 1, centred at x, the joint distribution 
of S{ui), . . . , S{uk) converges to that of Q2{ui), . . . , Q2{uk)- Tightness of the stochastic 
process S can be proved using the fact that, defining 

D{u,]q) = sup inf {pi(7j%(«)) + i^(l(„))aV(,^(«))) + (1 + P^^'''^'^'\^.^,{u))}: 

where the ordering ji{u),j2(u), ... is such that V(ij(„)) < Vi^i^i^u)) < ■ ■ among all indices 
i(u) such that ||V(i(ii)) — m|| < 1, the process £)(•, jo) decreases with increasing jo- 

5.3. Proof of Theorem 3 

Derivation of (3.13) is similar to that of the last part of Theorem 1, and so will not be 
given here. We shall outline proofs of (3.14) and (3.15). 

In the case of (3.14), take a{x) = S^ijj{x/5) where 6 = n~^^^P~^^'^^ and f/' is a spher- 
ically symmetric function supported on V(0, i) with bounded derivatives of first and 
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second orders, all of them dominated by ^C. Then, a E A. Consider the problem of 
discriminating between the models (a) Yi = Si and (b) Yi — a{Xi) + Si using only the 
data (Xi, Yi), . . . , (X„, y„). The likelihood-ratio approach, which in view of the Neyman- 
Pearson lemma is optimal, is to decide in favour of model (b) if and only if the ratio 

n 

L^l[[f{Y,^a{X,)}/f{Yi)] 

i=l 

exceeds an appropriate critical point. Here, / is the density at (3.11). If Yi = Ia{Xi) + Ei, 
where / = or 1 in cases (a) or (b), respectively, then 

n n 

logL = (c - 1) ^ log{l - Yr\{X,)} + J2 

i=l j=l 

Hence, the likelihood-ratio rule involves deciding in favour of (b) if and only if the sum 
I = log{l — Y~^a{Xi)} exceeds a critical point. 

Asymptotically correct discrimination is readily seen to be impossible if = nS^ 
is bounded; this quantity is of the same order as the number of pairs {Xi,Yi) for 
\ <i <n that contain information about a. The theorem will follow if we show that, 
when v = nS^ — > oo but S = o(n~^/(^'+^''^) along a subsequence, the probability of correct 
discrimination using the likelihood-ratio rule when cases (a) and (b) above both have 
prior probability ^ converges to i; it is assumed that all calculations are done for the 
subsequence. 

We may Taylor-expand £, showing that 1/5"^ = + (i^ — 5)^2 + ^3, where £1 = 
-E^ ^2 = S^E^£7^^f^ = HX^/S) and, when z/ ^ 00 and 5 = o(n-i/(p+2c))^ 
the remainder, £3, equals 0p(|£i| -I- \i2\)- Using the fact that < c < 2 it may be proved 
that i'~^^'^£i has a limiting, symmetric, non-degenerate stable distribution with exponent 
c, and (5~^z/~^/^£2 has a limiting, positive, non-degenerate stable law with exponent c/2. 
Therefore, if 5 = o(n^^''(P+^^^) then £2 = Op{£i), from which it follows that the probability 
of correct classification using the likelihood-ratio rule converges to ^. 

To obtain (3.15), let W denote the cube of diameter 2 inscribed within V(0, 1), with 
its sides parallel to the coordinate axes. Place into W a rectangular grid of points, 
xi,...,xn with nearest neighbours exactly S apart and no point distant less than 
from the boundary of V(0, 1). We may take N ^ const. S~p as (5 — > 0. Define ai{x) = 
S'^ Iitp{{x — Xi)/S}, where / = (/i, . . . , /^v) is a vector of O's and I's. Then aj E A for 
each choice of /. Since ip vanishes outside radius | from the origin, for each x no more 
than one term in this series is non-zero. Treating the problem of estimating aj on TZh as 
one of discriminating between /j ~ and for each i such that the sphere of radius 

i(5 centred at Xj intersects TZh and arguing as in the proof of (3.14) wc may derive (3.15). 

6. Conclusion 

We have shown that an alternative "regression" problem, where errors are "positioned" 
at their end-points, leads to estimators with properties quite different from their coun- 
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terparts in conventional problems. In particular, if the error density is bounded away 
from zero then relative fast convergence rates are possible, even if the regression mean is 
known only up to smoothness conditions. Results of this type can be compared with their 
counterparts in Li or L2 regression, where convergence rates using either method can 
be faster depending primarily on properties of the error distribution. Particularly if the 
main object were to obtain an idea of the shape of the regression mean, rather than for 
formal prediction, it would be appropriate to exploit these dissimilarities and construct 
a function estimator that enjoyed good convergence rates. Potential future problems of 
interest include developing adaptive methods for choosing among different regression 
methods, suitable for different error types, so as to ensure good empirical performance. 
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